Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Improved Sparsity Handling #2341

Merged
merged 14 commits into from
Nov 20, 2017

Conversation

nikicc
Copy link
Contributor

@nikicc nikicc commented May 25, 2017

Issue

Fixes #2322.

Blocked on #2734. Blocked on #2736.

Description of changes
  • DomainConversion now has three extra attributes telling whether X, Y, or metas should be sparse according to the features inside of them. Metas can only be sparse if no StringVariables are present.
  • Table.from_table determines the density of X, Y, and metas according to domain conversion.
  • Add to_sparse and to_dense methods to Table.
  • Reset variables caches to alleviate interdependencies between test files.
  • Fix showing sparse metas in Table widget.
Includes
  • Code changes
  • Tests
  • Documentation

@nikicc nikicc added the DH2017 label May 25, 2017
@nikicc nikicc force-pushed the sparsity-from-domain-conversion branch 3 times, most recently from 4b65c4e to 80475e7 Compare May 25, 2017 11:29
@jerneju
Copy link
Contributor

jerneju commented May 31, 2017

That PR should fix #2359 .

Copy link
Contributor

@jerneju jerneju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When someone wants to transform two columns from sparse X to dense Y she or he gets only one Y column which has got double the number of rows.

@jerneju
Copy link
Contributor

jerneju commented May 31, 2017

table = Table("iris")
table.X = sp.csr_matrix(table.X)
attributes = table.domain.attributes[:2] + table.domain.class_vars
class_vars = table.domain.attributes[2:4]
newtable = table.transform(Domain(attributes=attributes, class_vars=class_vars))
self.assertEqual(newtable.Y.shape[0], table.Y.shape[0])

@nikicc nikicc force-pushed the sparsity-from-domain-conversion branch from 125a142 to 71fc394 Compare June 1, 2017 14:37
@nikicc
Copy link
Contributor Author

nikicc commented Jun 1, 2017

@jerneju nice catch. Should be fixed not, please check.

@@ -463,6 +463,42 @@ def test_domain_conversion_is_fast_enough(self):

self.assertLessEqual(time() - start, 1)

def test_domain_conversion_sparsity(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually add something like this:

"""
Description.
GH-2341
"""

@nikicc nikicc added this to the 3.4.3 milestone Jun 2, 2017
@nikicc nikicc force-pushed the sparsity-from-domain-conversion branch 7 times, most recently from 942f527 to 120f92e Compare June 2, 2017 12:37
.. attribute:: sparse

A flag about sparsity of the variable. When set, the variable suggests
it is should be stored in a sparse matrix.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed 😄

@nikicc nikicc force-pushed the sparsity-from-domain-conversion branch 6 times, most recently from 4d20017 to c23bf8d Compare June 2, 2017 22:13
@codecov-io
Copy link

codecov-io commented Jun 2, 2017

Codecov Report

Merging #2341 into master will increase coverage by 0.01%.
The diff coverage is 95.45%.

@@            Coverage Diff            @@
##           master   #2341      +/-   ##
=========================================
+ Coverage   76.09%   76.1%   +0.01%     
=========================================
  Files         338     338              
  Lines       59723   59760      +37     
=========================================
+ Hits        45444   45479      +35     
- Misses      14279   14281       +2

@nikicc nikicc force-pushed the sparsity-from-domain-conversion branch 2 times, most recently from 8488448 to 68594d1 Compare June 2, 2017 22:58
@nikicc nikicc removed this from the 3.4.3 milestone Jun 3, 2017
@nikicc nikicc force-pushed the sparsity-from-domain-conversion branch 3 times, most recently from 67c4efb to 1d9c77a Compare November 10, 2017 09:59
@@ -63,6 +75,19 @@ def __init__(self, source, destination):
source.index(var) if var in source
else var.compute_value for var in destination.metas]

def is_sparse(feats):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should_be_sparse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That shoulds better 👍 Fixed.

@@ -280,45 +280,77 @@ def from_table(cls, domain, source, row_indices=...):

global _conversion_cache

def get_columns(row_indices, src_cols, n_rows, dtype=np.float64,
is_sparse=False):
def array_transform(x, to_sparse):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these functions be moved someplace else? They seem like general functions that ensure array/column is sparse/dense. Maybe change to a more expressive name.

def is_sparse(feats):
""" For a matrix to be stored in sparse, more than 2/3 of columns
should be marked as sparse and there should be no string columns
since Scipy's sparse matrices don't support dtype=object."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least the closing triple quotes of a multi-line docstring should be in their own line (PEP257).
(for opening quotes try to mimic other docstrings nearby)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to match the style as in nerby docstrings.

""" For a matrix to be stored in sparse, more than 2/3 of columns
should be marked as sparse and there should be no string columns
since Scipy's sparse matrices don't support dtype=object."""
fraction_sparse = sum(f.sparse for f in feats)/max(len(feats), 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space around operators

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed :)

@nikicc nikicc force-pushed the sparsity-from-domain-conversion branch from 1d9c77a to 20ae6d6 Compare November 10, 2017 14:26
@nikicc
Copy link
Contributor Author

nikicc commented Nov 10, 2017

@astaric, @markotoplak I split the functionality of array_transform and column_transform into 4 helper functions (see the last two fixup commints 4410596, 3d1a96d). Do you think that's better?

I haven't moved the functions yet, though. I will do this once we settle on the best approach. Perhaps to some util library?

@nikicc nikicc force-pushed the sparsity-from-domain-conversion branch 4 times, most recently from e4affcb to 4aa7120 Compare November 17, 2017 09:36
DomainConversion now has three more attributes `sparse_X`, `sparse_Y` and `sparse_metas`, which suggest whether the resulting matrix should be sparse or dense.
Table.from_table now sets the sparsity of X, Y and metas as determined by DomainConversion.
When we return subarryas, the flag `is_sparse` wasn't considered, but we simpy returned the subarray in it's original format. Also, make sure subarrays aren't flattened to 1d, as it is required for columns.
Calling len on scipy sparse matrices causes the error: `TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]`.
... to alleviate interdependencies between tests in differente files. Before that, one tests could mark some attributes of iris as sparse which could cause tests in different files to crash due to variable "reusing".
@nikicc nikicc force-pushed the sparsity-from-domain-conversion branch from 4aa7120 to 2b8d9a8 Compare November 17, 2017 09:40
@nikicc nikicc force-pushed the sparsity-from-domain-conversion branch from 2b8d9a8 to 4bf6a20 Compare November 17, 2017 11:16
@astaric astaric merged commit 301d13a into biolab:master Nov 20, 2017
@nikicc nikicc deleted the sparsity-from-domain-conversion branch November 20, 2017 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants